32 Bpp Graphics Coding
Introduction
This tutorial is based on how my current vesa/gfx engine works. I'd previously been doing just 16 bpp graphics, and I had to code all of my routines for 16 bpp. When I first started with 16 bpp it was quite a novelty, but after a while I was getting irritated with it and wanted a more flexable model. I saw demos which could run in TONS of modes like 8 bpp, 15 bpp, 16 bpp, 24 bpp, 32 bpp and even textmode! A lot of demos could have their mode changed from the commandline, and I realised that this pure 16 bpp model of mine was not so cool and very unflexable.
What this tutorial covers is a different way of coding gfx engines so that they can handle multiple color depths. Basically what happens is that you create all your memory buffers as if they were holding 32 bpp graphics, and all of your internal graphics code works at the 32 bpp level, and then finally when you want to flip the frame to the screen you just convert to the appropriate bpp level. So you could have conversion functions to convert between 32 bpp -> 16 bpp and 32 bpp -> 8 bpp and then you would flip that into video memory.
I also had a lot of trouble finding out the video mode for 32 bpp modes. All the vesa docs I read only had up to 24 bpp. Eventually I found (from UNIVBE) that 320x200x32bpp is 146h mode.
I can't rememeber where I heard of this idea from, but I do know that it's not original. In fact A LOT of demo groups use it. But since I couldn't find any tuts on it, and I thought it works very well I wrote this tut. So let's go then.
32 bpp Basics
Although 32 bpp alows way more colours than the other modes (16 bpp etc.) it is actually the easiest to code for! 15 and 16 bpp modes are cool, but they only offer 32768 and 65536 colours, and they are difficult to work with because they have the RGB values packed into them.
The 32 bpp format is easy, and of course each pixel takes up 32 bits (4 bytes) of memory. You have to be careful because a 320x200 surface can take up a lot more memory than lesser modes.
320x200x32bpp - 256k / layer
320x200x24bpp - 192k / layer
320x200x16bpp - 128k / layer
320x200x15bpp - 128k / layer ;may as well use 16 bpp huh?
320x200x08bpp - 64k / layer
So only four 32 bpp layers and you are using a MEG of memory!
Here is how the 4 bytes are structured:
[1 byte] [1 byte] [1 byte] [1 byte]
AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB
8 bits for Alpha channel, 8 for Red, 8 for green and 8 for blue. As you can see, you have the same range of RGB colours as you do in 24 bpp. So why use 32 bpp if it's just gonna take up more memory? Simple. First of all it's faster. Why would it be faster to read/write four bytes as opposed to three? Basically the computer handles R/W faster when it has to read an even number of bytes. Also, you don't get 24 bit registers. For example:
;24bpp clear screen
mov edi,[dest]
mov eax,[color] ;24bit color, with upper 8bits=0
mov ecx,64000 ;number of pixels for 320x200
@slowloop:
stosw ;write 2 bytes
stosb ;write 1 byte
dec ecx
jnz @slowloop
;32bpp clear screen
mov edi,[dest]
mov eax,[color] ;32bit color, with upper 8bits=0
mov ecx,64000 ;number of pixels for 320x200
rep stosd ;loop writing 32bits/time
Because there is no easy way of writing three bytes at a time, it's much easier to write four bytes. Hence 32 bpp modes.
Alpha Channel?
Well I must confess, as the time that I'm typing this I've never used the alpha channel, or really thought about what it could be used for... So I'm sort of gonna be making this up as I go along. But I'm sure you can think of groovy things to use it for. Having an extra 8 bits on your layers/surfaces is very handy indeed.
1. You could use it to define MANY characteristics of the surface pixels.
E.g.
A A A A A A A A
7 6 5 4 3 2 1 0 bits
| | | | | | | |
| | | | | | | |____Active
| | | | | | |
| | | | |_|_|
| | | | |________Draw style
| | | |
|_|_|_|
|_______________Percentage Transparent
!Active(0-1) - Whether the pixel is drawn/not.
Useful for images with holes in them.
Sort of like a built-in mask.
!Draw style(0-7) - How to draw the pixel.
eg, 0=normal(opaque)
1=additive
2=subtractive
3=multiplication
4=difference
5=transparent
6=?
7=?
!Percentage Transpart(0-15) - How transparent the pixel is.
So 15=fully opaque, and 0=invisible.
This is just an example of one way you could to things. Although I think a simplified version of the above would be better for the realtime demos of today.
2. You could keep things simple and just use the 8 alpha bits for doing your own internal transparency etc. This is probably what most people use it for. Very handy, but not something I've done myself.
More 32 bpp RGB
Ok, so now you know the format etc. Now to show you some nice things. Want to add 2 RGB pixels together? Sure, easy - not like 16 bpp.
;adding 2 32 bit colors together (assuming the alpha byte is ignored)
mov eax,[col1]
mov ebx,[col2]
and eax,11111111_11111110_11111110_11111110b
and ebx,11111111_11111110_11111110_11111110b
shr eax,1
shr ebx,1
add eax,ebx
mov [edi],eax
A ery nice trick that I found was with MMX instructions. They have something which I found perfect for 32 bpp functions. I'm not about to write an MMX tutorial so go and read another doc for that, but I want to introduce one MMX feature in particular. Saturated registers.
Let's take a simple additive surface loop. Here you have 2 320x200x32bpp surfaces, both with pictures on them and you want to add them together. E.g:
//pseudo code
long col1,col2,colf;
col1=memget(blah); //32bit
col2=memget(blah2); //32bit
colf.r=col1.r+col2.r;
colf.g=col1.g+col2.g;
colf.b=col1.b+col2.b;
//but now instead of dividing by 2 as
//we do for transparency, we clip
// to 255;
if (colf.r>255) colf.r=255;
if (colf.g>255) colf.g=255;
if (colf.b>255) colf.b=255;
memput(blah3)=colf;
Doing that for every pixel would be VERY slow yes? Even doing that in normal ASM would be slowish. But MMX can make it easier. I use NASM, you should too.
MMX Helps out
Saturated registers are registers which don't overflow. Normally if you a dded 250+20 in a byte value (say AL), at the end AL would = 4. So what MMX's saturated registers does is clips it. So when you do an MMX add, 250+20 = 255. Funky eh? MMX works with 8 mmx registers (MM0-MM7), each are 64 bit registers. So you can store 2 32BPP pixels in each register! This is VERY cool because it means that using 1 instruction you can additively add 2 pairs of pixels.
Two MMX instructions which I have found handy are: PADDUSB & PSUBUSB
PADDUSB - Saturated ADD, unsigned, saturated at the byte level.
PSUBUSB - Saturated SUB, unsigned, saturated at the byte level.
Here are 2 MMX registers (64 bits each) filled with 2 pixels each:
[-------------------------------64 BITS-------------------------------]
[-------------32 BITS-------------] [-------------32 BITS-------------]
[----16 BITS----] [----16 BITS----] [----16 BITS----] [----16 BITS----]
[8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS] [8 BITS]
MM0: AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB
MM1: AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB AAAAAAAA RRRRRRRR GGGGGGGG BBBBBBBB
The MMX instruction PADDUSB MM0,MM1 basically adds each 8 bit segment, and clips the addition to 255. Same with PSUBUSB MM0,MM1 except that is clips it to 0. Here is how we could use this in a complete function. This function does the same as the above pseudo code, but MUCH quicker.
;ASM 32bpp MMX adding
mov edi,[dest]
mov esi,[src]
mov ecx,32000
@MMX_layeraddloop:
movq MM0,[edi] ;Move QUAD(64bits)
movq MM1,[esi] ;Move QUAD(64bits)
paddusb MM0,MM1 ;Saturated Add
movq [esi],MM0 ;Move QUAD(64bits)
add esi,8
add edi,8
dec ecx
jnz @MMX_layeraddloop
EMMS ;Must always do this after about of
;MMX instructions
You won't believe how fast this is until you try it.
Conversion
Ok, so you've written a groovy internal 32 bpp gfx library. Complete with texture-mapped four dimensional splines and beautiful particles algorithms. Now what? Well you have to copy you buffer into videomemory so that it can be seen. The nice thing is that the viewer doesn't have to have a videocard that can handle 32 bpp. You can convert the image in the buffer to the appropriate format and then flip. E.g:
if (vmode==_32bit) FLIPtoSCREEN32_(final.addr);
else
if (vmode==_text) {
convtxt_(final.addr,buffery.addr);
FLIPtoSCREENtxt_(buffery.addr);
} else
if (vmode==_8bit) {
conv8_(final.addr,buffery.addr);
FLIPtoSCREEN8_(buffery.addr);
} else
if (vmode==_16bit) {
conv16_(final.addr,buffery.addr);
FLIPtoSCREEN16_(buffery.addr);
}
A nice feature that I've added to my demo (which I'm busy writing) is that you can change videomodes while running the demo by pressing F1-F4. I thought this was quite a groovy idea.
Before I actually sat to code my 32BPP engine, I thought it would be very slow to convert all the time. I mean one fullscreen color conversion MUST be slow. But it's not that bad. Why not? Ok, let's take the videomodes from the above code:
1. 32BPP - no conversion needed. Just a 256k flip. 2. 16BPP - conversion needed. But just then a 128k flip. 3. 8BPP - conversion needed. But just then a 64k flip. 4. text - conversion needed. But just then a 4k flip.
As you can see, even though you have to convert, the ammount of data you have to push to the video card becomes less, so it sort of compensates. And besides, the conversion routine ISN'T that costly. I actually love figuring out new ways (and faster ways) to convert between different pixel formats. It's FUN. Below are the algorithms that I use. If you use them please credit me and send me a little email. I don't claim that they are the best or anything, and if you can see kewl ways to improve them pleaser give me a shout.
Converting to 24 bpp
Well, this should be very easy. Just chop off the ALPHA channel. So I'll leave this one up to you.
Converting to 16 bpp
Have fun trying to come up with your own methods. I think PTC has some nice conversion routines, although I have yet to check them out.
;32BPP->16BPP conversion(320x200)
proc conv16_ src,dest:dword
pushad
push edi
push esi
mov edi,[dest]
mov esi,[src]
mov ecx,64000
@conv16_loop:
mov eax,[esi]
and eax,00000000111110001111110011111000b
shr ah,2
shr ax,3
ror eax,8
add al,ah
rol eax,8
stosw
add esi,4
dec ecx
jnz @conv16_loop
pop esi
pop edi
popad
ret
endp conv16_
Converting to 8 bpp (mode 13h)
Have fun trying to come up with your own methods. I think PTC has some nice conversion routines, although I have yet to check them out. This function doesn't take into account the palette. In fact, all it does is assume you've set your palette to go from 0 (black) to 255 (white), and then finds the approximate brightness of the RGB values and uses them. I know it's lame, but I've seen other demos doing the same thing. Oh well, I'm sure I'll write a color palette version very soon, as I've only adopted this 32 bpp internal mode about two weeks ago.
;32BPP->8BPP conversion(320x200)
proc conv8_ src,dest:dword
pushad
push edi
push esi
mov edi,[dest]
mov esi,[src]
mov ecx,64000
@conv8_loop:
mov ebx,[esi]
mov eax,ebx
rol ebx,16
and ebx,255
and eax,255
add ax,bx
ror ebx,16
shr ebx,8
and ebx,255
add ax,bx
shr eax,2
stosb
add esi,4
dec ecx
jnz @conv8_loop
pop esi
pop edi
popad
ret
endp conv8_
Converting to TextMode
Hmmm, this was hard. Hehe, it's amazing that with these graphics modes, it seems to get easier with the more colors you can have. I mean 32 bpp is dead easy to code, 16 bpp is harder, and textmode is quite a mission. This is a VERY simple hack, and if you can make a better one, please let me know all about it. This one just writes character #176, #177, #178, #219 to the screen depending on the brightness of the RGB value. And it also selects the color (0-15) based on the "brightness" of the RGB value. So it assumes that your palette goes from dark to bright. Unfortunately I haven't made it to funky things like realtime change the palette or search for the best color. I'll probably do this soon. This is basically just a test:
;32BPP->Textmode conversion (80x50)
proc convtxt_ src,dest:dword
pushad
push edi
push esi
mov edi,[dest]
mov esi,[src]
mov edx,50
@convtxt_loopy:
mov ecx,80
@convtxt_loopx:
mov ebx,[esi]
mov eax,ebx
rol ebx,16
and ebx,255
and eax,255
add ax,bx
ror ebx,16
shr ebx,8
and ebx,255
add ax,bx
shr eax,2
mov ah,al
mov bl,0
cmp al,0
jle @asc0
cmp al,48
jge @asc0
mov bl,176
jmp @ascout
@asc0:
cmp al,48
jle @asc1
cmp al,96
jge @asc1
mov bl,177
jmp @ascout
@asc1:
cmp al,96
jle @asc2
cmp al,144
jge @asc2
mov bl,178
jmp @ascout
@asc2:
cmp al,144
jle @asc3
mov bl,219
jmp @ascout
@asc3:
@ascout:
shr ah,4
mov al,bl
stosw
add esi,16
dec ecx
jnz @convtxt_loopx
add esi,3840
dec edx
jnz @convtxt_loopy
pop esi
pop edi
popad
ret
endp convtxt_
Closing Words
Phew. I really hope this helps some people out there, in some way or another. Please send me any thoughts/ideas/improvements on this topic, I'd really like to hear/see them. The scene is wonderful, long live the scene. When I die I want to go to a scene heaven.
-Rawhed/Sensory Overload -Mailto:andrew@overload.co.za -Http://www.overload.co.za -Andrew Griffiths -South Africa -05-07-1999